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Abstract 

This  paper  presents  part  of  an  on-going  project 
to  integrate  perception,  attention,  drives,  emo¬ 
tions,  behavior  arbitration,  and  expressive  acts 
for  a  robot  designed  to  interact  socially  with 
humans.  We  present  the  design  of  a  visual  at¬ 
tention  system  based  on  a  model  of  human  vi¬ 
sual  search  behavior  from  Wolfe  (1994).  The 
attention  system  integrates  perceptions  (mo¬ 
tion  detection,  color  saliency,  and  face  pop- 
outs)  with  habituation  effects  and  influences 
from  the  robot’s  motivational  and  behavioral 
state  to  create  a  context-dependent  attention 
activation  map.  This  activation  map  is  used  to 
direct  eye  movements  and  to  satiate  the  drives 
of  the  motivational  system. 

1  Introduction 

Socially  intelligent  robots  provide  both  a  natural  human- 
machine  interface  and  a  mechanism  for  bootstrapping 
more  complex  behavior.  However,  social  skills  often  re¬ 
quire  complex  perceptual,  motor,  and  cognitive  abilities 
[Brooks  et  al.,  1998].  Our  research  has  focused  on  a 
developmental  approach  to  building  socially  intelligent 
robots  that  utilize  natural  human  social  cues  to  interact 
with  and  learn  from  human  caretakers. 

This  paper  discusses  the  construction  of  one  necessary 
sub-system  for  social  intelligence:  an  attention  system. 
To  provide  a  basis  for  more  complex  social  behaviors, 
an  attention  system  must  direct  limited  computational 
resources  and  select  among  potential  behaviors  by  com¬ 
bining  perceptions  horn  a  variety  of  modalities  with  the 
existing  motivational  and  behavioral  state  of  the  robot. 
We  present  a  robotic  implementation  of  an  attention  sys¬ 
tem  based  upon  models  of  human  attention  and  visual 
search.  We  further  outline  the  ways  in  which  this  model 
interacts  with  existing  perceptual,  motor,  motivational, 
and  behavioral  systems. 

Our  implementation  is  based  upon  Wolfe’s  model  of 
human  visual  attention  and  visual  search  [Wolfe,  1994  . 
This  model  integrates  evidence  from  Treisman  [1985], 
Julesz  [1988],  and  others  to  construct  a  flexible  model 


of  human  visual  search  behavior.  In  Wolfe’s  model,  vi¬ 
sual  stimuli  are  filtered  by  broadly-tuned  “categorical” 
channels  (such  as  color  and  orientation)  to  produce  fea¬ 
ture  maps  with  activation  based  upon  both  local  regions 
(bottom-up)  and  task  demands  (top-down).  The  feature 
maps  are  combined  by  a  weighted  sum  to  produce  an 
activation  map.  Limited  cognitive  and  motor  resources 
are  distributed  in  order  of  decreasing  activation.  This 
model  has  been  tested  in  simulation,  and  yields  results 
that  are  similar  to  those  observed  in  human  subjects 
[Wolfe,  1994].  In  this  paper  we  do  not  attempt  to  match 
human  performance  (a  task  that  is  difficult  with  cur¬ 
rent  component  technology),  but  rather  require  only  that 
the  robotic  system  perform  enough  like  a  human  that  it 
is  capable  of  maintaining  a  normal  social  interaction. 
Our  implementation  is  similar  to  other  models  based  in 
part  on  Wolfe's  work  [itti  et  al .,  1998;  Hashimoto,  1998; 
Driscoll  et  al .,  1998],  but  additionally  operates  in  con¬ 
junction  with  motivational  and  behavioral  models,  with 
moving  cameras,  and  it  differs  in  dealing  with  habitua¬ 
tion  issues. 

2  Robot  Hardware 

Our  robotic  platform  consists  of  a  stereo  active  vi¬ 
sion  system  (described  in  [Scassellati,  1998a])  augmented 
with  facial  features  for  emotive  expression.  The  robot, 
called  Kismet  and  shown  in  Figure  1,  is  able  to  show 
expressions  (analogous  to  anger,  fatigue,  fear,  disgust, 
excitement,  happiness,  interest,  sadness,  and  surprise) 
which  are  easily  interpreted  by  an  untrained  human  ob¬ 
server.  The  platform  has  four  degrees  of  freedom  in  the 
vision  system;  each  eye  has  an  independent  vertical  axis 
of  rotation  (pan)  and  the  eyes  share  a  joint  horizontal 
axis  of  rotation  (tilt).  Kismet  also  has  fifteen  degrees  of 
freedom  in  facial  features,  including  eyebrows,  ears,  eye- 
fids,  lips,  and  a  mouth.  Each  eyeball  has  an  embedded 
color  CCD  camera  with  a  5.6  mm  focal  length. 

The  active  vision  platform  is  attached  to  a  parallel  net¬ 
work  of  eight  50M  Hz  digital  signal  processors  (Texas  In¬ 
struments  TMS320C40).  The  DSP  network  serves  as  the 
sensory  processing  engine  and  implements  the  bulk  of  the 
robot’s  perception  and  attention  systems.  A  pair  of  Mo¬ 
torola  68332- based  microcontrollers  are  also  connected 
to  the  robot.  One  controller  implements  the  motor  sys- 
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Figure  1:  Kismet,  a  robot  designed  to  interact  socially 
with  humans.  Kismet  has  an  active  vision  system  and 
can  display  a  variety  of  facial  expressions. 


tern  for  driving  the  robot’s  facial  motors.  The  other 
controller  implements  the  motivational  system  (emotions 
and  drives)  and  the  behavior  system.  The  microcon¬ 
trollers  communicate  with  the  DSP  network  through  a 
dual-ported  RAM. 

3  Perceptual  Systems 

Our  current  perceptual  systems  focus  on  the  pre- 
attentive,  massively  parallel  stage  of  human  vision  that 
processes  information  about  basic  visual  features  (color, 
motion,  various  depth  cues,  etc.).  The  implementation 
described  here  focuses  on  three  such  pre-attentive  pro¬ 
cesses:  color,  motion,  and  face  pop-outs.  In  terms  of  the 
model  from  Wolfe  [1994],  our  implementation  contains 
the  bottom- up  feature  maps,  which  represent  the  inher¬ 
ent  saliency  of  a  specific  image  property  for  each  point 
in  the  visual  scene,  and  incorporates  top-down  influences 
from  motivational  and  behavioral  sources. 

The  video  signal  from  each  of  Kismet’s  cameras  is  dig¬ 
itized  by  one  of  the  DSP  nodes  with  specialized  frame 
grabbing  hardware.  The  image  is  then  subsampled  and 
averaged  to  an  appropriate  size.  For  these  initial  tests, 
we  have  used  an  image  size  of  64  x  64,  which  allows  us 
to  complete  all  of  the  processing  in  near  real-time.  To 
minimize  latency,  each  feature  map  is  computed  by  a  sep¬ 
arate  DSP  processor  (each  of  which  also  has  additional 
computational  task  load).  All  of  the  feature  detectors 
discussed  here  can  operate  at  multiple  scales. 

3.1  Color  Saliency  Feature  Maps 

One  of  the  most  basic  and  widely  recognized  visual  fea¬ 
ture  is  color.  Our  models  of  color  saliency  are  drawn 
from  the  complimentary  work  on  visual  search  and  at¬ 
tention  from  Itti,  Koch,  and  Niebur  [1998].  The  incom¬ 
ing  video  stream  contains  three  8- bit  color  channels  (r,  g, 
and  b)  which  are  transformed  into  four  color-opponency 
channels  (r',  g',  b',  and  y').  Each  input  color  channel  is 


first  normalized  by  the  luminance  /  (a  weighted  average 
of  the  three  input  color  channels): 
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These  normalized  color  channels  are  then  used  to  pro¬ 
duce  four  opponent-color  channels: 


r1  =  r„  -  ( gn  +  ft„)/ 2  (2) 

91  =  9n~  (r„  4-  ft„)/2  (3) 

b'  =  bn  -  (r„  +g„)/2  (4) 

y'  =  —  2^"  -  bn  -  Ikn  -  9n  ||  (5) 


The  four  opponent-color  channels  are  clamped  to  8- bit 
values  by  thresholding.  While  some  research  seems  to 
indicate  that  each  color  channel  should  be  considered  in¬ 
dividually  [Nothdurft,  1993],  we  choose  to  maintain  all 
of  the  color  information  in  a  single  feature  map  to  sim¬ 
plify  the  processing  requirements  (as  does  Wolfe  [1994] 
for  more  theoretical  reasons).  The  maximum  of  the  four 
opponent-color  values  is  computed  and  then  smoothed 
with  a  uniform  5x5  field  to  produce  the  output  color 
saliency  feature  map.  This  smoothing  serves  both  to 
eliminate  pixel-level  noise  and  to  provide  a  neighborhood 
of  influence  to  the  output  map,  as  proposed  by  Wolfe 
[1994].  A  single  DSP  node  computes  these  computa¬ 
tions  and  forwards  the  resulting  feature  map  both  to  the 
attention  process  at  a  rate  of  20-25  Hz.  The  processor 
produces  a  pseudo-color  image  by  scaling  the  luminance 
of  the  original  image  by  the  output  saliency  while  retain¬ 
ing  the  same  relative  chrominance  (as  shown  in  Figure 
2). 


3.2  Motion  Saliency  Feature  Maps 

In  parallel  with  the  color  saliency  computations,  a  sec¬ 
ond  processor  receives  input  images  from  the  frame  grab¬ 
ber  and  computes  temporal  differences  to  detect  motion. 
The  incoming  image  is  converted  to  grayscale  and  placed 
into  a  frame  buffer  ring.  A  raw  motion  map  is  computed 
by  passing  the  absolute  difference  between  consecutive 
images  through  a  threshold  function  T : 

Mraw=T(\\It-It-l\\)  (6) 

This  raw  motion  map  is  then  smoothed  with  a  uniform 
7x8  field.  While  using  a  5  x  5  field  would  have  main¬ 
tained  consistency  with  both  Wolfe’s  model  and  the  color 
saliency  feature  map,  using  a  slightly  larger  field  size  al¬ 
lows  us  to  use  the  output  of  the  motion  saliency  map  as 
a  pre-filter  to  the  face  detection  routine,  which  has  opti¬ 
mized  performance  in  prior  tests  by  a  factor  of  3  [Scas- 
sellati,  1998b].  The  motion  saliency  feature  map  is  com¬ 
puted  at  25-30  Hz  by  a  single  DSP  processor  node  and 
forwarded  both  to  the  attention  process  and  the  VGA 
display. 


3.3  Face  Pop-Out  Feature  Maps 

While  form  and  size  are  part  of  Wolfe’s  original  model, 
we  have  extended  the  concept  to  include  other  known 
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Figure  2:  Overview  of  the  attention  system.  A  variety  of  visual  feature  detectors  (color,  motion,  and  face  detectors) 
combine  with  a  habituation  function  to  produce  an  attention  activation  map.  The  attention  process  influences  eye 
control  and  the  robot’s  internal  motivational  and  behavioral  state,  which  in  turn  influence  the  weighted  combination 
of  the  feature  maps.  Displayed  images  were  captured  during  a  behavioral  trial  session. 


pop-out  features  that  have  social  relevance  such  as  faces. 
Our  face  detection  techniques  are  designed  to  identify 
locations  that  are  likely  to  contain  a  face,  not  to  ver¬ 
ify  with  certainty  that  a  face  is  present  in  the  image. 
The  face  detector  is  based  on  the  ratio- template  tech¬ 
nique  developed  by  Sinha  [1996],  and  has  been  previously 
reported  [Scassellati,  1998b].  The  ratio  template  algo¬ 
rithm  was  designed  to  detect  frontal  views  of  faces  under 
varying  lighting  conditions,  and  is  an  extension  of  classi¬ 
cal  template  approaches  [Sinha,  1996].  Ratio  templates 
also  offer  multiple  levels  of  biological  plausibility;  tem¬ 
plates  can  be  either  hand-coded  or  learned  adaptively 
from  qualitative  image  invariants  [Sinha,  1996]. 

A  ratio  template  is  composed  of  regions  and  relations, 
as  shown  to  the  left,  of  the  face  detector  in  Figure  2.  For 
each  target  location  in  the  grayscale  peripheral  image,  a 
template  comparison  is  performed  using  a  special  set  of 
comparison  rules.  The  set  of  regions  is  convolved  with 
a  14  x  16  image  patch  around  a  pixel  location  to  give 
the  average  grayscale  value  for  that  region.  Relations 
are  comparisons  between  region  values,  for  example,  be¬ 
tween  the  “left  forehead”  region  and  the  “left  temple” 
region.  The  relation  is  satisfied  if  the  ratio  of  the  fust 
region  to  the  second  region  exceeds  a  constant  value  (in 
our  case,  1.1).  The  number  of  satisfied  relations  serves 
as  the  match  score  for  a  particular  location;  the  more 
relations  that  are  satisfied  the  more  likely  that  a  face  is 
located  there.  In  Figure  2,  each  arrow  indicates  a  re¬ 
lation,  with  the  head  of  the  arrow  denoting  the  second 


region  (the  denominator  of  the  ratio). 

The  ratio  template  algorithm  has  been  shown  to 
be  reasonably  invariant  to  changes  in  illumination  and 
slight  rotational  changes  [Scassellati,  1998b].  The  ratio 
template  algorithm  processes  video  streams  in  real  time 
using  optimization  and  pre-filtering  techniques,  and  the 
system  has  been  tested  on  a  variety  of  fighting  condi¬ 
tions  and  subjects.  The  algorithm  can  operate  on  each 
level  of  an  image  pyramid  in  order  to  detect  faces  at 
multiple  scales.  In  the  current  implementation,  due  to 
limited  processing  capability,  we  elected  to  process  only 
a  single  scale  for  faces.  Applied  to  a  64  x  64  image  from 
Kismet’s  cameras,  the  14  x  16  ratio  template  finds  faces 
in  a  range  of  approximately  3-6  feet  from  the  robot.  This 
range  was  suitable  for  our  current  investigations  of  face- 
to-face  social  interactions,  and  could  easily  be  expanded 
with  additional  processors.  The  implemented  face  detec¬ 
tor  operates  at  approximately  15-20  Hz. 

4  Behaviors  and  Motivations 

In  previous  work,  Breazeal  and  Scassellati  [1998]  pre¬ 
sented  how  the  design  of  Kismet’s  motivation  and  be¬ 
havior  systems  enable  it  to  socially  interact  with  a  hu¬ 
man  while  regulating  the  intensity  of  the  interaction  via 
expressive  displays.  For  the  purposes  of  this  paper,  we 
present  only  those  aspects  of  these  systems  which  bias 
the  robot’s  attention  (see  Figure  3). 

Perceptual  stimuli  are  classified  into  social  stimuli  (i.e. 
people,  which  move  and  have  faces)  which  satisfy  a  drive 


Figure  3:  Schematic  of  motivations  and  behaviors  rele¬ 
vant  to  attention.  See  text  for  details. 

to  be  social  and  non-social  stimuli  (i.e.  toys,  which  move 
and  are  colorful)  which  satisfy  a  drive  to  be  stimulated 
by  other  things  in  the  environment. 

For  each  drive,  there  is  a  desired  operation  point,  and 
an  acceptable  bounds  of  operation  around  that  point 
(the  homeostatic  regime).  As  long  as  a  drive  is  within 
the  homeostatic  regime,  that  corresponding  need  is  be¬ 
ing  adequately  met.  Unattended,  drives  drift  toward 
an  under-stimulated  regime.  Excessive  stimulation  (too 
many  stimuli  or  stimuli  moving  too  quickly)  push  a  drive 
toward  an  over-stimulated  regime. 

The  robot’s  drives  influence  behavior  selection  by  pref¬ 
erentially  passing  activation  to  select  behaviors.  By 
doing  so,  the  robot  is  more  likely  to  activate  behav¬ 
iors  that  serve  to  restore  its  drives  to  their  homeo¬ 
static  regimes.  The  top  level  (level  0)  of  the  behav¬ 
ior  system  consists  of  a  single  cross-exclusion  group 
(CEG)  containing  two  behaviors:  satiate  social  and 
satiate  stimulation.  Each  behavior  is  viewed  as  a 
self-interested,  goal-directed  process.  Within  a  CEG, 
behaviors  compete  for  activation  in  a  winner-take-all 
scheme  based  upon  perceptual  factors,  motivational  fac¬ 
tors,  and  its  own  behavioral  persistence.  Competition 
between  behaviors  at  the  top  level  represents  selection 
at  the  task  level.  By  organizing  the  top  level  behaviors 
in  this  fashion,  the  robot  can  only  act  to  restore  one 
drive  at  a  time.  This  is  reasonable  since  the  satiating 
stimuli  for  each  drive  are  mutually  exclusive  and  require 
different-  behaviors.  Specifically,  whenever  the  satiate 
social  behavior  wins,  the  robot’s  task  is  to  do  what  it 
must  to  restore  the  social  drive,  and  when  the  satiate 
stimulation  behavior  wins,  the  robot’s  task  is  to  do 
what  it  must  to  restore  the  stimulation  drive. 

Each  behavior  node  of  the  top  level  CEG  has  a  child 
CEG  (level  1)  associated  with  it.  Once  a  level  0  behavior 
wins  the  competition,  it  activates  its  child  CEG  at  level 
1.  Subsequently,  the  behaviors  within  the  active  level 


1  CEG  compete  for  activation.  Competition  between 
behaviors  within  the  active  level  1  CEG  represents  com¬ 
petition  at  the  strategy  level.  Each  behavior  has  its  own 
distinct  conditions  for  becoming  relevant  and  winning 
the  competition.  For  instance,  the  avoid  person  be¬ 
havior  is  the  most  relevant  when  the  robot’s  social  drive 
is  in  the  overwhelmed  regime  and  a  person  is  stimulat¬ 
ing  the  robot  too  vigorously.  The  goal  of  this  behavior  Ls 
to  reduce  the  intensity  of  stimulation.  If  successful,  the 
social  drive  will  be  restored  to  the  homeostatic  regime. 
Similarly,  the  goal  of  the  seek  person  behavior  is  to  ac¬ 
quire  a  social  stimulus  of  reasonable  intensity.  If  success¬ 
ful,  this  will  serve  to  restore  the  social  drive  from  the 
under-stimulated  regime.  The  engage  person  behavior 
is  active  by  default  (i.e.  the  social  drive  is  already  in 
the  homeostatic  regime  and  the  robot  is  receiving  a  good 
quality  stimulus). 

5  Attention  System 

The  attention  system  must  combine  the  various  effects 
of  the  perceptual  input  with  the  existing  motivational 
and  behavioral  state  of  the  robot  both  to  direct  limited 
computational  resources  and  to  select  among  potential 
behaviors.  Figure  2  shows  an  overview  of  the  attention 
system. 

5.1  Combining  Perceptual  Inputs 

Each  of  the  feature  maps  contains  an  8- bit  value  for  each 
pixel  location  which  represents  the  relative  presence  of 
that  visual  scene  feature  at  that  pixel.  The  attention 
process  combines  each  of  these  feature  maps  using  a 
weighted  sum  to  produce  an  attention  activation  map 
(using  the  terminology  of  Wolfe  [1994]).  The  gains  for 
each  feature  map  default  to  values  of  200  for  color,  40 
for  motion,  and  50  for  face  detection.  The  attention 
activation  map  is  tliresholded  to  remove  noise  values, 
and  normalized  by  the  sum  of  the  gains.  Connected  ob¬ 
ject  regions  are  extracted  using  a  grow-and-merge  pro¬ 
cedure  with  8-connectivity.  To  further  combine  related 
regions,  any  regions  whose  bounding  boxes  have  a  sig¬ 
nificant  overlap  are  also  merged. 

Statistics  on  each  region  are  collected,  including  the 
centroid,  bounding  box,  area,  average  attention  activa¬ 
tion  score,  and  average  score  for  each  of  the  feature  maps 
in  that  region.  The  tagged  regions  that  have  an  area 
in  excess  of  30  pixels  are  sorted  based  upon  their  av¬ 
erage  attention  activation  score.  The  attention  process 
provides  the  top  three  regions  to  both  the  eye  motor 
control  system  and  the  behavior  and  motivational  sys¬ 
tems.  The  eye  motor  control  system  uses  the  centroid 
of  the  most  salient  regions  to  determine  where  to  look 
next.  The  top-down  processes  use  the  attention  activa¬ 
tion  score  and  the  individual  feature  map  scores  of  the 
most  salient  region  to  determine  which  of  the  drives  and 
behaviors  will  become  activated. 

5.2  Attention  Drives  Eye  Movement 

The  eye  motor  control  process  acts  on  the  data  from 
the  attention  process  to  center  the  eyes  on  an  object 


within  the  visual  field.  Our  current  implementation  uses 
a  static  linear  mapping  between  image  position  and  eye 
position,  which  has  been  sufficient  for  our  initial  inves¬ 
tigations.  We  are  currently  in  the  process  of  converting 
to  a  self-calibrated  system  that  learns  the  sensori-motor 
mapping  for  foveation  similar  to  that  described  by  Sea s- 
sellati  [ 1998a]. 

Each  time  that  the  eyes  move,  the  eye  motor  process 
sends  two  signals.  The  first  signal  inhibits  the  motion 
detection  system  for  approximately  600  msec,  which  pre¬ 
vents  self-motion  from  appearing  in  the  motion  feature 
map.  The  second  signal  resets  the  habituation  state, 
which  Ls  described  below. 


Attention  Gains 


Figure  4:  Changes  of  the  face,  motion,  and  color  gains 
from  top-down  motivational  and  behavioral  influences 
(top).  When  the  social  drive  is  activated  by  face  stimuli 
(middle),  the  face  gain  is  influenced  by  the  seek  people 
and  avoid  people  behaviors.  When  the  stimulation 
drive  is  activated  by  color  stimuli  (bottom),  the  color 
gain  is  influenced  by  the  seek  toys  and  avoid  toys 
behaviors.  All  plots  show  the  same  4  minute  period. 


5.3  Habituation 

For  our  robot,  the  current  object  under  consideration  is 
always  the  object  that  is  in  the  center  of  the  visual  field.1 

lThis  is  extremely  relevant  on  our  other  robotic  platforms 
which  have  a  second  camera  that  captures  a  high  resolution 


The  habituation  function  can  be  viewed  as  a  feature  map 
that  initially  maintains  eye  fixation  by  increasing  the 
saliency  of  the  center  of  the  field  of  view  and  slowly  de¬ 
cays  the  saliency  values  of  central  objects  until  a  salient 
off-center  object  causes  the  eyes  to  move.  The  habitua¬ 
tion  function  is  a  Gaussian  field  G(x,  y)  centered  in  the 
field  of  view  with  peak  amplitude  of  255  (to  remain  con¬ 
sistent  with  the  other  8-bit  values)  and  6  =  50  pixels.  It 
is  combined  linearly  with  the  other  feature  maps  using 
the  weight 

w  =  W  ■  max(- 1, 1  -  A t/r)  (7) 

where  w  is  the  weight,  At  is  the  time  since  the  last  habit¬ 
uation  reset,  r  is  a  time  constant,  and  W  is  the  maximum 
habituation  gain.  Whenever  the  eyes  move,  the  habitu¬ 
ation  function  is  reset,  forcing  w  to  W  and  amplifying 
the  saliency  of  central  objects  until  a  time  r  when  w  =  0 
and  there  is  no  influence  from  the  habituation  map.  As 
time  progresses,  w  decays  to  a  minimum  value  of  —W 
which  suppresses  the  saliency  of  central  objects.  In  the 
current  implementation,  we  use  a  value  of  W  =  10  and 
a  time  constant  r  =  5  seconds. 

The  entire  attention  process  (with  habituation)  oper¬ 
ates  at  10-25  Hz  on  a  single  DSP  processor  node.  The 
speed  varies  with  the  number  of  attention  activation  pix¬ 
els  that  pass  threshold  for  region  growing.  While  this 
code  could  be  optimized  further,  rates  above  10  Hz  are 
not  necessary  for  our  current  purposes. 

5.4  Motivations  and  Behaviors  Influence 
Feature  Map  Gains 

Kismet’s  drives  and  behaviors  bias  the  attentional  gains 
based  on  the  current  internal  context  to  preferentially  at¬ 
tend  to  behaviorally  relevant  stimuli.  Behaviors  that  sa¬ 
tiate  the  stimulation  drive  influence  the  color  saliency 
gain  because  color  is  characteristic  of  toys.  Similarly,  the 
face  saliency  gain  is  adjusted  when  the  robot  is  tending 
to  its  social  drive.  Active  level  1  behaviors  influence 
attentional  gains  in  proportion  to  the  intensity  of  the 
associated  drive. 

As  shown  in  Figure  3,  the  face  gain  is  enhanced  when 
the  seek  people  behavior  is  active  and  is  suppressed 
when  the  avoid  people  behavior  is  active.  Similarly, 
the  color  gain  is  enhanced  when  the  seek  toys  behavior 
is  active,  and  suppressed  when  the  avoid  toys  behavior 
is  active.  Whenever  the  engage  people  or  engage  toys 
behaviors  are  active,  the  face  and  color  gains  are  restored 
to  their  default  values,  respectively.  Weight  adjustments 
are  constrained  such  that  the  total  sum  of  the  weights 
remains  constant  at  all  times.  Figure  4  illustrates  how 
the  face,  motion,  and  color  gains  are  adjusted  as  a  func¬ 
tion  of  drive  intensity,  the  active  level  1  behavior,  and 
the  nature  and  quality  of  the  perceptual  stimulus. 

6  Results  and  Evaluation 

Top-down  gain  adjustments  combine  with  bottom-up  ha¬ 
bituation  effects  to  bias  the  robot’s  gaze  preference  (see 

foveal  image. 
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Figure  5:  Preferential  looking  based  on  habituation  and  top-down  influences.  When  presented  with  two  salient 
stimuli  (a  face  and  a  brightly  colored  toy),  the  robot,  prefers  to  look  at  the  stimulus  that  has  behavioral  relevance. 
Habituation  causes  the  robot  to  also  spend  time  looking  at  the  non-preferred  stimulus. 


Figure  5).  When  the  seek  people  behavior  is  active, 
the  face  gain  is  enhanced  and  the  robot  prefers  to  look 
at  a  face  over  a  colorful  toy.  The  robot  eventually  ha¬ 
bituates  to  the  face  stimulus  and  switches  gaze  briefly 
to  the  toy  stimulus.  Once  the  robot  has  moved  its  gaze 
away  from  the  face  stimulus,  the  habituation  is  reset  and 
the  robot  rapidly  re-acquires  the  face.  In  one  set  of  be¬ 
havioral  trials  when  seek  people  was  active,  the  robot 
spent  80%  of  the  time  looking  at  the  face.  A  similar  af¬ 
fect  can  be  seen  when  the  seek  toy  behavior  is  active 
the  robot  prefers  to  look  at  a  toy  over  a  face  83%  of 
the  time. 

The  opposite  effect  is  apparent  when  the  avoid 
people  behavior  is  active.  In  this  case,  the  face  gain 
is  suppressed  so  that  faces  become  less  salient  and  are 
more  rapidly  affected  by  habituation.  Because  the  toy  is 
relatively  more  salient  than  the  face,  it  takes  longer  for 
the  robot  to  habituate.  Overall,  the  robot  looks  at  faces 
only  5%  of  the  time  when  in  this  behavioral  context.  A 
similar  scenario  holds  when  the  robot’s  avoid  toy  be¬ 
havior  is  active  —  the  robot  looks  at  toys  only  24%  of 
the  time. 

7  Future  Work 

In  this  paper  we  have  demonstrated  an  attent.ional  sys¬ 
tem  that  combines  bottom-up  perceptions  and  habitua¬ 
tion  effects  with  top-down  behavioral  and  motivational 
influences.  This  results  in  a  system  that  directs  eye  gaze 
based  on  current  task  demands.  In  the  future,  we  intend 
to  construct  a  richer  set  of  perceptual  inputs  (depth, 
orientation,  and  texture)  and  motor  responses  (smooth 
pursuit  tracking,  vergence,  and  vestibulo-ocular  reflex). 
We  are  also  currently  combining  this  system  with  ex¬ 
pressive  behaviors  to  facilitate  social  interaction  with  a 
human. 
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